AITopics | upper confidence bound

Collaborating Authors

upper confidence bound

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MALinZero: Efficient Low-Dimensional Search for Mastering Complex Multi-Agent Planning

Neural Information Processing SystemsJun-12-2026, 16:21:38 GMT

Monte Carlo Tree Search (MCTS), which leverages Upper Confidence Bound for Trees (UCTs) to balance exploration and exploitation through randomized sampling, is instrumental to solving complex planning problems. However, for multi-agent planning, MCTS is confronted with a large combinatorial action space that often grows exponentially with the number of agents. As a result, the branching factor of MCTS during tree expansion also increases exponentially, making it very difficult to efficiently explore and exploit during tree search. To this end, we propose MALinZero, a new approach to leverage low-dimensional representational structures on joint-action returns and enable efficient MCTS in complex multi-agent planning. Our solution can be viewed as projecting the joint-action returns into the low-dimensional space representable using a contextual linear bandit problem formulation. We solve the contextual linear bandit problem with convex and $\mu$-smooth loss functions -- in order to place more importance on better joint actions and mitigate potential representational limitations -- and derive a linear Upper Confidence Bound applied to trees (LinUCT) to enable novel multi-agent exploration and exploitation in the low-dimensional space. We analyze the regret of MALinZero for low-dimensional reward functions and propose an $(1-\tfrac1e)$-approximation algorithm for the joint action selection by maximizing a sub-modular objective. MALinZero demonstrates state-of-the-art performance on multi-agent benchmarks such as matrix games, SMAC, and SMACv2, outperforming both model-based and model-free multi-agent reinforcement learning baselines with faster learning speed and better performance.

artificial intelligence, name change, proceedings, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)

Add feedback

Bootstrapping Upper Confidence Bound

Neural Information Processing SystemsFeb-6-2026, 04:11:25 GMT

Upper Confidence Bound (UCB) method is arguably the most celebrated one used in online decision making with partial information feedback. Existing techniques for constructing confidence bounds are typically built upon various concentration inequalities, which thus lead to over-exploration. In this paper, we propose a non-parametric and data-dependent UCB algorithm based on the multiplier bootstrap. To improve its finite sample performance, we further incorporate second-order correction into the above construction. In theory, we derive both problem-dependent and problem-independent regret bounds for multi-armed bandits under a much weaker tail assumption than the standard sub-Gaussianity. Numerical results demonstrate significant regret reductions by our method, in comparison with several baselines in a range of multi-armed and linear bandit problems.

artificial intelligence, data mining, machine learning, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Data Science > Data Mining > Big Data (0.62)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.44)

Add feedback

Adaptive Budgeted Multi-Armed Bandits for IoT with Dynamic Resource Constraints

Vaishnav, Shubham, Donta, Praveen Kumar, Magnússon, Sindri

arXiv.org Artificial IntelligenceAug-15-2025

Internet of Things (IoT) systems increasingly operate in environments where devices must respond in real time while managing fluctuating resource constraints, including energy and bandwidth. Yet, current approaches often fall short in addressing scenarios where operational constraints evolve over time. To address these limitations, we propose a novel Budgeted Multi-Armed Bandit framework tailored for IoT applications with dynamic operational limits. Our model introduces a decaying violation budget, which permits limited constraint violations early in the learning process and gradually enforces stricter compliance over time. We present the Budgeted Upper Confidence Bound (UCB) algorithm, which adaptively balances performance optimization and compliance with time-varying constraints. We provide theoretical guarantees showing that Budgeted UCB achieves sublinear regret and logarithmic constraint violations over the learning horizon. Extensive simulations in a wireless communication setting show that our approach achieves faster adaptation and better constraint satisfaction than standard online learning methods. These results highlight the framework's potential for building adaptive, resource-aware IoT systems.

artificial intelligence, constraint, constraint-based reasoning, (16 more...)

arXiv.org Artificial Intelligence

2505.0264

Country: Europe > Sweden (0.14)

Genre: Research Report (0.82)

Industry:

Energy (0.93)
Information Technology > Smart Houses & Appliances (0.76)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)

Add feedback

Reviews: Bootstrapping Upper Confidence Bound

Neural Information Processing SystemsJan-23-2025, 05:47:27 GMT

I should be acknowledged that it is significantly more complex that UCB1 for example. Indeed at each time step B bootstrap repetitions are needed to estimated the bootstrapped quantiles, and each of them require to drawn n_k random variables for each arm k (the values of w's). Also, this requires to store the past rewards obtained on all arms, which requires a lot a memory. This constraint is also needed for the empirical KL-UCB mentioned above, which is one more reason to compare the two algorithms that have similar complexity. From Theorem 2, I guess that the w's are Rademacher random variables, but it would be good to specify this in the statement of the algorithm.

algorithm, bootstrapping upper confidence bound, upper confidence bound, (14 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.40)

Add feedback

Bootstrapping Upper Confidence Bound

Neural Information Processing SystemsOct-9-2024, 21:53:57 GMT

bootstrapping upper confidence bound, upper confidence bound

Neural Information Processing Systems

Technology:

Information Technology > Data Science > Data Mining > Big Data (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.40)

Add feedback

On Lai's Upper Confidence Bound in Multi-Armed Bandits

Ren, Huachen, Zhang, Cun-Hui

arXiv.org Machine LearningOct-3-2024

In this memorial paper, we honor Tze Leung Lai's seminal contributions to the topic of multi-armed bandits, with a specific focus on his pioneering work on the upper confidence bound. We establish sharp non-asymptotic regret bounds for an upper confidence bound index with a constant level of exploration for Gaussian rewards. Furthermore, we establish a non-asymptotic regret bound for the upper confidence bound index of Lai (1987) which employs an exploration function that decreases with the sample size of the corresponding arm. The regret bounds have leading constants that match the Lai-Robbins lower bound. Our results highlight an aspect of Lai's seminal works that deserves more attention in the machine learning literature.

lai, ucb index, upper confidence bound, (14 more...)

arXiv.org Machine Learning

2410.02279

Country:

North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > Illinois (0.04)
(2 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

The Multi-fidelity Multi-armed Bandit, Jeff Schneider

Neural Information Processing SystemsMar-12-2024, 09:16:16 GMT

We study a variant of the classical stochastic K-armed bandit where observing the outcome of each arm is expensive, but cheap approximations to this outcome are available. For example, in online advertising the performance of an ad can be approximated by displaying it for shorter time periods or to narrower audiences. We formalise this task as a multi-fidelity bandit, where, at each time step, the forecaster may choose to play an arm at any one of M fidelities.

approximation, fidelity, mf-ucb, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)

Industry: Marketing (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.65)

Add feedback

Adaptive Proximal Policy Optimization with Upper Confidence Bound

Zhang, Ziqi, Xu, Jingzehua, Zhuang, Zifeng, Liu, Jinxin, wang, Donglin

arXiv.org Artificial IntelligenceDec-12-2023

Trust Region Policy Optimization (TRPO) attractively optimizes the policy while constraining the update of the new policy within a trust region, ensuring the stability and monotonic optimization. Building on the theoretical guarantees of trust region optimization, Proximal Policy Optimization (PPO) successfully enhances the algorithm's sample efficiency and reduces deployment complexity by confining the update of the new and old policies within a surrogate trust region. However, this approach is limited by the fixed setting of surrogate trust region and is not sufficiently adaptive, because there is no theoretical proof that the optimal clipping bound remains consistent throughout the entire training process, truncating the ratio of the new and old policies within surrogate trust region can ensure that the algorithm achieves its best performance, therefore, exploring and researching a dynamic clip bound for improving PPO's performance can be quite beneficial. To design an adaptive clipped trust region and explore the dynamic clip bound's impact on the performance of PPO, we introduce an adaptive PPO-CLIP (Adaptive-PPO) method that dynamically explores and exploits the clip bound using a bandit during the online training process. Furthermore, ample experiments will initially demonstrate that our Adaptive-PPO exhibits sample efficiency and performance compared to PPO-CLIP.

adaptive-ppo, old policy, trust region, (14 more...)

arXiv.org Artificial Intelligence

2312.07624

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Industry: Education > Educational Setting > Online (0.56)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback

Efficient Cluster Selection for Personalized Federated Learning: A Multi-Armed Bandit Approach

Ni, Zhou, Hashemi, Morteza

arXiv.org Artificial IntelligenceOct-29-2023

Federated learning (FL) offers a decentralized training approach for machine learning models, prioritizing data privacy. However, the inherent heterogeneity in FL networks, arising from variations in data distribution, size, and device capabilities, poses challenges in user federation. Recognizing this, Personalized Federated Learning (PFL) emphasizes tailoring learning processes to individual data profiles. In this paper, we address the complexity of clustering users in PFL, especially in dynamic networks, by introducing a dynamic Upper Confidence Bound (dUCB) algorithm inspired by the multi-armed bandit (MAB) approach. The dUCB algorithm ensures that new users can effectively find the best cluster for their data distribution by balancing exploration and exploitation. The performance of our algorithm is evaluated in various cases, showing its effectiveness in handling dynamic federated learning scenarios.

algorithm, ducb algorithm, new user, (12 more...)

arXiv.org Artificial Intelligence

2310.19069

Country: